AITopics | temporal segment

Collaborating Authors

temporal segment

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

SlowFocus: Enhancing Fine-grained Temporal Understanding in Video LLM

Neural Information Processing SystemsFeb-16-2026, 18:24:31 GMT

However, current Vid-LLMs struggle to simultaneously retain high-quality frame-level semantic information ( i.e., a sufficient

artificial intelligence, large language model, natural language, (17 more...)

Neural Information Processing Systems

Country: Asia > China > Shanghai > Shanghai (0.04)

Genre:

Research Report > Experimental Study (0.93)
Research Report > New Finding (0.67)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Weakly Supervised Dense Event Captioning in Videos

Xuguang Duan, Wenbing Huang, Chuang Gan, Jingdong Wang, Wenwu Zhu, Junzhou Huang

Neural Information Processing SystemsFeb-12-2026, 18:35:56 GMT

Among the wide variety of applications on video understanding, the video captioning task is attracting more and more interests in recent years [4, 5, 6, 7, 8, 9, 10, 11].

artificial intelligence, machine learning, video, (12 more...)

Neural Information Processing Systems

Country:

Asia > China > Beijing > Beijing (0.05)
North America > Canada > Quebec > Montreal (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.95)

Add feedback

792dd774336314c3c27a04bb260cf2cf-Supplemental.pdf

Neural Information Processing SystemsFeb-9-2026, 11:13:45 GMT

Finally,we train our model for 8hours on asingle V100GPU. We provide an illustration of our weakly supervised phrase grounding model in Figure 4b (this supplemental). Specifically,we create context-preserving negativecaptions for an image by substituting anoun in its original caption with negativenouns, that are sampled from apretrained BERT [17] model. Forexample,inthecase where only one cross-attention layer is used, adding the sentence-level contrastive loss leads to a 2.5%intheR@1accuracy. These videos contain transcribed narrations thatareeither uploaded manually byusersor aretheoutputofanautomatic speech recognition (ASR) system.

artificial intelligence, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.52)
Information Technology > Artificial Intelligence > Machine Learning (0.47)

Add feedback

Multi-modal Grouping Network for Weakly-Supervised Audio-Visual Video Parsing

Neural Information Processing SystemsDec-25-2025, 12:46:22 GMT

The audio-visual video parsing task aims to parse a video into modality-and category-aware temporal segments. Previous work mainly focuses on weakly-supervised approaches, which learn from video-level event labels. During training, they do not know which modality perceives and meanwhile which temporal segment contains the video event. Since there is no explicit grouping in the existing frameworks, the modality and temporal uncertainties make these methods suffer from false predictions. For instance, segments in the same category could be predicted in different event classes. Learning compact and discriminative multi-modal subspaces is essential for mitigating the issue. To this end, in this paper, we propose a novel Multi-modal Grouping Network, namely MGN, for explicitly semantic-aware grouping.

multi-modal grouping network, name change, weakly-supervised audio-visual video parsing, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.46)

Add feedback

Weakly Supervised Dense Event Captioning in Videos

Xuguang Duan, Wenbing Huang, Chuang Gan, Jingdong Wang, Wenwu Zhu, Junzhou Huang

Neural Information Processing SystemsNov-20-2025, 23:27:44 GMT

Wenwu Zhu is the corresponding author.

artificial intelligence, machine learning, natural language, (16 more...)

Neural Information Processing Systems

Country:

Asia > China > Beijing > Beijing (0.04)
North America > Canada > Quebec > Montreal (0.04)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

SlowFocus: Enhancing Fine-grained Temporal Understanding in Video LLM

Neural Information Processing SystemsOct-10-2025, 10:09:42 GMT

However, current Vid-LLMs struggle to simultaneously retain high-quality frame-level semantic information ( i.e., a sufficient

arxiv preprint, language model, video, (15 more...)

Neural Information Processing Systems

Country: Asia > China > Shanghai > Shanghai (0.04)

Genre:

Research Report > Experimental Study (0.93)
Research Report > New Finding (0.67)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Robust Group Anomaly Detection for Quasi-Periodic Network Time Series

Yang, Kai, Dou, Shaoyu, Luo, Pan, Wang, Xin, Poor, H. Vincent

arXiv.org Artificial IntelligenceJun-23-2025

Many real-world multivariate time series are collected from a network of physical objects embedded with software, electronics, and sensors. The quasi-periodic signals generated by these objects often follow a similar repetitive and periodic pattern, but have variations in the period, and come in different lengths caused by timing (synchronization) errors. Given a multitude of such quasi-periodic time series, can we build machine learning models to identify those time series that behave differently from the majority of the observations? In addition, can the models help human experts to understand how the decision was made? We propose a sequence to Gaussian Mixture Model (seq2GMM) framework. The overarching goal of this framework is to identify unusual and interesting time series within a network time series database. We further develop a surrogate-based optimization algorithm that can efficiently train the seq2GMM model. Seq2GMM exhibits strong empirical performance on a plurality of public benchmark datasets, outperforming state-of-the-art anomaly detection techniques by a significant margin. We also theoretically analyze the convergence property of the proposed training algorithm and provide numerical results to substantiate our theoretical claims.

artificial intelligence, data mining, machine learning, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/TNSE.2022.3170364

2506.16815

Country:

Asia > China (0.69)
North America > United States > Minnesota (0.28)

Genre: Research Report (0.82)

Industry:

Information Technology (1.00)
Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)
Health & Medicine > Diagnostic Medicine (0.68)

Technology:

Information Technology > Data Science > Data Mining > Anomaly Detection (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Rethinking Word Similarity: Semantic Similarity through Classification Confusion

Zhou, Kaitlyn, Gao, Haishan, Chen, Sarah, Edelstein, Dan, Jurafsky, Dan, Shani, Chen

arXiv.org Artificial IntelligenceFeb-8-2025

Word similarity has many applications to social science and cultural analytics tasks like measuring meaning change over time and making sense of contested terms. Yet traditional similarity methods based on cosine similarity between word embeddings cannot capture the context-dependent, asymmetrical, polysemous nature of semantic similarity. We propose a new measure of similarity, Word Confusion, that reframes semantic similarity in terms of feature-based classification confusion. Word Confusion is inspired by Tversky's suggestion that similarity features be chosen dynamically. Here we train a classifier to map contextual embeddings to word identities and use the classifier confusion (the probability of choosing a confounding word c instead of the correct target word t) as a measure of the similarity of c and t. The set of potential confounding words acts as the chosen features. Our method is comparable to cosine similarity in matching human similarity judgments across several datasets (MEN, WirdSim353, and SimLex), and can measure similarity using predetermined features of interest. We demonstrate our model's ability to make use of dynamic features by applying it to test a hypothesis about changes in the 18th C. meaning of the French word "revolution" from popular to state action during the French Revolution. We hope this reimagining of semantic similarity will inspire the development of new tools that better capture the multi-faceted and dynamic nature of language, advancing the fields of computational social science and cultural analytics and beyond.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2502.05704

Country:

Europe (0.93)
North America > United States > California (0.28)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report > New Finding (0.68)

Industry: Government > Regional Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.68)

Add feedback

Multi-modal Grouping Network for Weakly-Supervised Audio-Visual Video Parsing

Neural Information Processing SystemsJan-19-2025, 02:51:12 GMT

The audio-visual video parsing task aims to parse a video into modality- and category-aware temporal segments. Previous work mainly focuses on weakly-supervised approaches, which learn from video-level event labels. During training, they do not know which modality perceives and meanwhile which temporal segment contains the video event. Since there is no explicit grouping in the existing frameworks, the modality and temporal uncertainties make these methods suffer from false predictions. For instance, segments in the same category could be predicted in different event classes.

multi-modal grouping network, temporal segment, weakly-supervised audio-visual video parsing, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.69)

Add feedback

Boosting Audio Visual Question Answering via Key Semantic-Aware Cues

Li, Guangyao, Du, Henghui, Hu, Di

arXiv.org Artificial IntelligenceJul-30-2024

The Audio Visual Question Answering (AVQA) task aims to answer questions related to various visual objects, sounds, and their interactions in videos. Such naturally multimodal videos contain rich and complex dynamic audio-visual components, with only a portion of them closely related to the given questions. Hence, effectively perceiving audio-visual cues relevant to the given questions is crucial for correctly answering them. In this paper, we propose a Temporal-Spatial Perception Model (TSPM), which aims to empower the model to perceive key visual and auditory cues related to the questions. Specifically, considering the challenge of aligning non-declarative questions and visual representations into the same semantic space using visual-language pretrained models, we construct declarative sentence prompts derived from the question template, to assist the temporal perception module in better identifying critical segments relevant to the questions. Subsequently, a spatial perception module is designed to merge visual tokens from selected segments to highlight key latent targets, followed by cross-modal interaction with audio to perceive potential sound-aware areas. Finally, the significant temporal-spatial cues from these modules are integrated to answer the question. Extensive experiments on multiple AVQA benchmarks demonstrate that our framework excels not only in understanding audio-visual scenes but also in answering complex questions effectively. Code is available at https://github.com/GeWu-Lab/TSPM.

proceedings, tspm, video, (13 more...)

arXiv.org Artificial Intelligence

2407.20693

Country:

Oceania > Australia > Victoria > Melbourne (0.05)
Asia > China > Beijing > Beijing (0.04)
North America > United States > New York > New York County > New York City (0.04)
(3 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.64)

Add feedback